Introduction

This report is the first one to document and study the feasability of the automatic quality evaluation of experimental literature investigating bio–nano interactions. The first step of this automatic evaluation is to isolate the section Materials and Methods. The goal is to use later this section only to assess if the characterisation of the nano-materials is done and ebaluate the quality of the articles.

This report contain preliminary analyses and exploration of the data contained in the corpus of text. The first goal of this analyses is to gain some understanding of the structure of the texts inside the corpus of articles and the relations of the lemmas “material(s)” and “method(s)” to this corpus.

The second goal is to investigate how to discriminate the beginning of the section “Materials and methods”. The main problem to identify entry of the section Materials and Methods is that some of this two words can be present in the text of the article (typically “cf” material and methods").

The corpus of text has been created from the 751 articles from the folder “Full Text dev set”, which contain 751 articles converted into txt file format. The others articles are kept unseen to test the efficacy of any other tools developped later in “real life condition”.

Few definitions to frame the problem :

A quick exploratory data analysis on the article Abrams, MT et al, 2010, led to think that the the “materials” token from the section material and method has a specific property : is head_token_id is equal to zero, i.e. the “head” of this word is itself (cf example under). This led to think that sections titles of aritcles may have this property. This hypothesis will be test in the first part of this report, and in a later section, for the lemma “materials” and “material” (Co-occurences for materials and material when their head_token_id = 0)

In the later section, we will try differents criteria to isolate some lemmas “materials”, “material”, “methods” and “method”. We will use a technic, co-occurences, to explore the surronding of the differents lemmas in the text and evaluate if this criteria allow to discriminate the beginning of the section materials and methods from the remaining of the article.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is a good way to create informal reports describing data analysis projects as a web page, and a good way to mix code and description in a readable maner. There is even books in this format, ranging from Data Analysis for the Life Sciences to Text Mining with R, A Tidy Approach, so anybody can understand and retake this work. This report is also code, it can be recompiled with new data (including an other model for the annotation of the corpus).

Import and datastructure

library(udpipe)
library(lattice)
library(wordcloud)
library(igraph)
library(ggraph)
library(ggplot2)
library(dplyr)

The following lines load the corpus of text, already annotated and tokenized :

x <- readRDS(file = "annotation_lines.rds")
x <- as.data.frame(x)
length(unique(x$doc_id))
## [1] 751

Here an example of a token “materials” with a head_token_id = 0 :

x[7467,]
##      doc_id paragraph_id sentence_id
## 7467   doc1          602         841
##                                                                                                                                                                                                                     sentence
## 7467 siRNA concentration in tissues was determined using a modified stem-loop RT-PCR protocol.21 Samples preserved in RNAlater (Qiagen, Valencia, CA) were homogenized in Trizol buffer (Qiagen) at 20 µg/µl in a bead mill.
##      token_id  token  lemma  upos   xpos       feats head_token_id dep_rel
## 7467       30 Qiagen Qiagen PROPN SG-NOM Number=Sing            27   appos
##      deps          misc
## 7467 <NA> SpaceAfter=No

Words with head_token_id == 0

Considering the observation that, in “Materials and Methods” the head_token_ID was 0 for the token “Materials”, one idea was to explore what are, in the corpus of texts, the most common lemma with a head_token_ID equal to zero.

The expected outcome of this analysis could be to retrieve the usual sections title of scientific articles inside the most common words, like Abstract or Results. The goal is to assess if it is a consistent property of the titles of section inside the articles and uncover potential synonyms to “materials and methods” like “experimental section”.

stats <- subset(x, head_token_id == 0) #https://bnosac.github.io/udpipe/docs/doc7.html
stats <- txt_freq(x = stats$lemma)

stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0", xlab = "Freq")

Nonetheless, it seems that this assumption was quite naive, as lot of token have this property. Let’s filter for specific lemmas that correspond to usual title of section, like abstract of results :

stats<-stats %>% filter(key %in% c("material", "materials", "result", "results", "abstract", "introduction" , "method", "methods", "discussion", "references"))

stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Count of lemma for usual sections name with Head_token_id = 0", xlab = "Freq")

stats
##            key freq     freq_pct
## 1       result 1889 0.3692344981
## 2       method  379 0.0740814583
## 3   discussion  230 0.0449570855
## 4 introduction  166 0.0324472878
## 5     material  121 0.0236513363
## 6      methods   42 0.0082095547
## 7     abstract   20 0.0039093118
## 8      results    1 0.0001954656

Some section titles seems to have the afored mentionned property. Nonetheless, the number does not match the total number of articles in this corpus (751). To take the example of the token discussion, or some articles does not have a section dicussion, or, more probably, the token discussion does not have the property mentionned earlier. We can answer this question :

occurrences<-which(x$lemma=="discussion")
length(occurrences)
## [1] 900
length(unique(x[occurrences,]$doc_id))
## [1] 709

There is 900 occurrences of the word discussion in all the corpus, and 709 article with this word. It seems really likely that discriminating tokens that are section titles just with a head token ID of zero is not sufficient.

Visualize the most recurent head_token_id of the lemma material, materials, method and methods

To explore the relationships of the lemmas “material(s)” and “method(s)” with the rest of the corpus, we can analyse what are the most recurents head tokens for the lemmas “material” and “materials”. The goals of the analysis are :

  • to observe if the lemma “material(s)” is often associated as head with the lemma “material(s)” and with which frequency
  • to observe what are the other lemma that are commonly the head of the lemma material(s)
  • same question(s) for the lemma “method” and “methods”

Lemma material

grep_lemma_head_token_id <- function(index){
  #catch the lemma corresponding to the head_token_id of the token at the entry "index" of x
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  head_token_id<-occurrence$head_token_id
  head_token_id<-as.numeric(head_token_id)
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  #the following line query the lemma of the head_token_id based on the previous parameters
  lemma_head_token_id<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[head_token_id],]$lemma
  if (head_token_id==0) {lemma_head_token_id=occurrence$lemma}
  return(lemma_head_token_id)
}

material_occurrences<-which(x$lemma=="material")
head_token_lemmas<-sapply(material_occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 

stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring lemma corresponding to the head_token_id \n for lemma material", xlab = "Freq")

Lemma materials

occurrences<-which(x$lemma=="materials") 

head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 

stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma materialS with an s", xlab = "Freq")

Lemma method

occurrences<-which(x$lemma=="method") 

head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 


stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")

Lemma methods

occurrences<-which(x$lemma=="methods") 

head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 


stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")

head(stats, 10)
##     head_token_lemmas Freq       key
## 59          MATERIALS   72 MATERIALS
## 37            describ   42   describ
## 63            methods   42   methods
## 10                and   22       and
## 48            Immunol   22   Immunol
## 106               use   13       use
## 57           material   11  material
## 58           Material    9  Material
## 51                  j    5         j
## 88            section    5   section

Co-occurences

Co-occurences for material(s) and method(s)

In the next sessions we test differents criteria to discriminate the lemmas “materials” and “material” inside the articles. The idea is to find a criteria that allow to identify the beginning of the section “materials and methods”.

Co-occurrence is an analysis that allow to see how words are used either in the same sentence or next to each other. We will use this approach to have a sense of what is the neighbourhood of the lemmas we isolated based on each criteria.

There is several type of cooccurrences analysis : * Looking at which words are located in the same document/sentence/paragraph. * Looking at which words are followed by another word. * Looking at which words are in the neighbourhood of the word as in follows the word within skipgram number of words.

Cf doc of the package Updipe for the three possible use. We will use the second approach, as it is the most relevant to our goal and as it is the most simple to interpret. Differents skipgram can be used to got an idea of the distance or more proximal neighbourhood.

The two function above are meant to gain some place in the document. The first one plot the word network, a common technique to visualise word cooccurrences, after the filtration of the cooccurrences that concerns only the lemma of interrest.

plot_cooccurrence <- function(stats, lemma, title){
  #function to gain place and make this Rmarkdown document more clear
  stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
  wordnetwork <- head(stats, 30)
  wordnetwork <- graph_from_data_frame(wordnetwork)
  ggraph(wordnetwork, layout = "fr") +
    geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
    geom_node_text(aes(label = name), col = "blue", size = 5) +
    theme_graph(base_family = "Helvetica") +
    theme(legend.position = "none") +
    labs(title = title)
}
head_cooc <- function(stats, lemma){
  #function to gain place and make this Rmarkdown document more clear
  stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
  head(stats, 30)
}
stats <- cooccurrence(x = x$lemma, skipgram = 0)

Bigger skipgram were not really relevant. Here we can simply count the elements of the dataframe stats to see how many times each word follow each other.

plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials")

head_cooc(stats, lemma="materials")
##       term1     term2 cooc
## 1   contact materials    1
## 2 materials        22    1
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material")

head_cooc(stats, lemma="material")
##            term1    term2 cooc
## 1       material        .  401
## 2       material       be  341
## 3       material        ,  323
## 4            the material  222
## 5             of material  214
## 6           this material  208
## 7  supplementary material  203
## 8       material      and  173
## 9       material       in  148
## 10      material        (  146
## 11      material       at  131
## 12             t material  126
## 13          test material   80
## 14      material      for   75
## 15          bulk material   67
## 16      material     that   59
## 17      material     have   58
## 18      nanotube material   56
## 19           and material   55
## 20       foreign material   54
## 21      material     with   53
## 22     reference material   48
## 23      material        :   43
## 24      material       to   38
## 25       genetic material   38
## 26      material       on   34
## 27      material      the   32
## 28      material      can   32
## 29             . material   31
## 30      material        [   30
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods")

head_cooc(stats, lemma="methods")
##            term1        term2 cooc
## 1            and      methods  143
## 2        methods            .   55
## 3             in      methods   45
## 4              .      methods   29
## 5        Immunol      methods   25
## 6        methods     Material   20
## 7              ,      methods   17
## 8            see      methods   17
## 9        methods            )   17
## 10       methods          for   15
## 11       methods           to   11
## 12           use      methods   10
## 13       methods            ,   10
## 14       methods            t    9
## 15       methods     Chemical    9
## 16       methods          and    9
## 17       methods       Animal    8
## 18          Mech      methods    8
## 19       methods      section    8
## 20       methods  Preparation    7
## 21       methods Nanoparticle    6
## 22       methods         cell    5
## 23       methods          Mol    4
## 24       methods          2.1    4
## 25          test      methods    4
## 26 supplementary      methods    4
## 27       methods     material    3
## 28       methods    Synthesis    3
## 29       methods          the    3
## 30       methods           NP    3
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method")

head_cooc(stats, lemma="method")
##        term1    term2 cooc
## 1        and   method  518
## 2     method        .  502
## 3     method      for  469
## 4        the   method  449
## 5          .   method  407
## 6     method       of  318
## 7     method       be  280
## 8     method        ,  279
## 9     method       to  220
## 10    method        (  215
## 11      this   method  151
## 12    method      2.1  133
## 13    method      and  133
## 14    method      use  129
## 15    method  describ  127
## 16    method        :  116
## 17         a   method   94
## 18    method       in   83
## 19    method        [   79
## 20         )   method   65
## 21    method     have   61
## 22      test   method   51
## 23    method       as   51
## 24    method        )   49
## 25    method Material   47
## 26 sensitive   method   46
## 27    method     with   44
## 28    method      Mol   41
## 29    method   Animal   38
## 30     vitro   method   38

Co-occurences, visualization of all the lemma of interest

plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1    term2 cooc
## 1            and   method  518
## 2         method        .  502
## 3         method      for  469
## 4            the   method  449
## 5              .   method  407
## 6       material        .  401
## 7       material       be  341
## 8       material        ,  323
## 9         method       of  318
## 10        method       be  280
## 11        method        ,  279
## 12           the material  222
## 13        method       to  220
## 14        method        (  215
## 15            of material  214
## 16          this material  208
## 17 supplementary material  203
## 18      material      and  173
## 19          this   method  151
## 20      material       in  148
## 21      material        (  146
## 22           and  methods  143
## 23        method      2.1  133
## 24        method      and  133
## 25      material       at  131
## 26        method      use  129
## 27        method  describ  127
## 28             t material  126
## 29        method        :  116
## 30             a   method   94

Co-occurences for materials and material when their head_token_id = 0

Similar to the previous approach, we want to explore the relationships of the differents lemma with their neighbourhood in the corpus of text, but we restrict the analysis for sentences for which the lemma material or materials is the head token of itself.

Even if not all the “Materials and Methods” section titles has a “materials” lemma with a head_token_id equal to zero, the opposite could be true.

Here, by restricting to the lemmas “materials” and “material” which have a head_token_id = 0, we can visualize their statistical association with other words and understand if this subsets of token is really delimiting the beginning of section “material and methods”.

The first function allow to filter for sentences where the lemma material or materials is the head. The following lines calculate the co-occurrences and draw the plot as previously.

create_subset_corpus<- function(index){
  #this function is aimed to help construct a subset of x for the part of the analysis :
  #Co-occurences for materials and material when their head_token_id = 0
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  #the following lines collect the head_token_id and test if is equal to zero
  #if so, its output the tokens of the sentences
  head_token_id<-occurrence$head_token_id
  if (head_token_id==0) {return(strip_corpus(doc_id, sentence_id))} 
  return()
}

strip_corpus <- function(doc_id, sentence_id){
  #this function returns all the lemma of a sentence, in the appropriate format
  #the purpose of doing so is to allow for calculation of cooccurence of words inside this sentences
  #for this we need all the elements of the sentence
  sentence_id<-as.numeric(sentence_id)
  subset_article<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id),]
  return(subset_article)
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

# stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
# plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when its head_token_id is equal to 0")
# head_cooc(stats, lemma="materials")
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas :  materials, material, method, method, \n when head_token_id of lemma materials is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1    term2 cooc
## 1            and   method  518
## 2         method        .  502
## 3         method      for  469
## 4            the   method  449
## 5              .   method  407
## 6       material        .  401
## 7       material       be  341
## 8       material        ,  323
## 9         method       of  318
## 10        method       be  280
## 11        method        ,  279
## 12           the material  222
## 13        method       to  220
## 14        method        (  215
## 15            of material  214
## 16          this material  208
## 17 supplementary material  203
## 18      material      and  173
## 19          this   method  151
## 20      material       in  148
## 21      material        (  146
## 22           and  methods  143
## 23        method      2.1  133
## 24        method      and  133
## 25      material       at  131
## 26        method      use  129
## 27        method  describ  127
## 28             t material  126
## 29        method        :  116
## 30             a   method   94
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when its head_token_id is equal to 0\n when its head_token_id is equal to 0")

head_cooc(stats, lemma="material")
##            term1         term2 cooc
## 1  supplementary      material   26
## 2       material             .   17
## 3              t      material   14
## 4       material           and   12
## 5      reference      material    7
## 6       material             (    6
## 7              .      material    6
## 8    copyrighted      material    6
## 9       material supplementary    5
## 10      material            in    4
## 11      material           for    4
## 12      material             ,    4
## 13     important      material    4
## 14      material          that    3
## 15      material      material    3
## 16      material          with    3
## 17   particulate      material    3
## 18     composite      material    3
## 19      material             t    3
## 20      material             ;    3
## 21    mesoporous      material    3
## 22      material            as    3
## 23      material            of    3
## 24      material             :    2
## 25      material     available    2
## 26      material        within    2
## 27         stent      material    2
## 28    Mesoporous      material    2
## 29        Nature      material    2
## 30      nanotube      material    2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma material is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1         term2 cooc
## 1  supplementary      material   26
## 2       material             .   17
## 3              t      material   14
## 4       material           and   12
## 5      reference      material    7
## 6       material             (    6
## 7              .      material    6
## 8    copyrighted      material    6
## 9       material supplementary    5
## 10           and       methods    4
## 11       methods             t    4
## 12           and        method    4
## 13      material            in    4
## 14      material           for    4
## 15      material             ,    4
## 16     important      material    4
## 17      material          that    3
## 18      material      material    3
## 19      material          with    3
## 20   particulate      material    3
## 21     composite      material    3
## 22      material             t    3
## 23      material             ;    3
## 24    mesoporous      material    3
## 25      material            as    3
## 26      material            of    3
## 27      material             :    2
## 28      material     available    2
## 29      material        within    2
## 30         stent      material    2

Co-occurences for methods and method when their head_token_id = 0

occurrences<-which(x$lemma=="methods")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods \n when its head_token_id is equal to 0")

head_cooc(stats, lemma="methods")
##          term1         term2 cooc
## 1          and       methods   16
## 2      methods      Material    7
## 3         Mech       methods    7
## 4      methods           for    6
## 5      methods             .    4
## 6      methods             ,    4
## 7            ,       methods    3
## 8            .       methods    3
## 9      methods          2013    2
## 10          in       methods    2
## 11     methods         novel    1
## 12       novel       methods    1
## 13     methods          this    1
## 14     methods      Chemical    1
## 15     Immunol       methods    1
## 16     methods          2010    1
## 17     methods           Mol    1
## 18 Purification       methods    1
## 19     methods            28    1
## 20     methods           use    1
## 21  assessment       methods    1
## 22     methods 2008;44:61e72    1
## 23     methods  Nanomaterial    1
## 24     methods     MATERIALS    1
## 25  Microscopy       methods    1
## 26     methods   Polystyrene    1
## 27     \fcount       methods    1
## 28     methods        silver    1
## 29      Emerge       methods    1
## 30     methods           and    1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma methods is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##          term1         term2 cooc
## 1          and       methods   16
## 2      methods      Material    7
## 3         Mech       methods    7
## 4      methods           for    6
## 5      methods             .    4
## 6      methods             ,    4
## 7            ,       methods    3
## 8            .       methods    3
## 9      methods          2013    2
## 10          in       methods    2
## 11     methods         novel    1
## 12       novel       methods    1
## 13     methods          this    1
## 14     methods      Chemical    1
## 15     Immunol       methods    1
## 16     methods          2010    1
## 17     methods           Mol    1
## 18 Purification       methods    1
## 19     methods            28    1
## 20     methods           use    1
## 21  assessment       methods    1
## 22     methods 2008;44:61e72    1
## 23     methods  Nanomaterial    1
## 24     methods     MATERIALS    1
## 25  Microscopy       methods    1
## 26     methods   Polystyrene    1
## 27     \fcount       methods    1
## 28     methods        silver    1
## 29      Emerge       methods    1
## 30     methods           and    1
occurrences<-which(x$lemma=="method")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method \n when its head_token_id is equal to 0")

head_cooc(stats, lemma="method")
##           term1   term2 cooc
## 1             .  method  115
## 2        method     for   88
## 3        method       :   39
## 4        method      to   34
## 5        method       .   30
## 6             :  method   20
## 7        method      of   19
## 8        method      in   17
## 9        method       ,   17
## 10       method  method   15
## 11       method     the   11
## 12          the  method   11
## 13       method     and   10
## 14         Mech  method    9
## 15    sensitive  method    8
## 16            a  method    7
## 17     standard  method    5
## 18   Analytical  method    5
## 19       method    with    5
## 20 nanotoxicity  method    5
## 21        vitro  method    5
## 22       method    that    4
## 23       method     use    4
## 24          Nat  method    4
## 25  Statistical  method    4
## 26            &  method    4
## 27         this  method    4
## 28       method Enzymol    4
## 29            ;  method    4
## 30       simple  method    4
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma method is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##           term1    term2 cooc
## 1             .   method  115
## 2        method      for   88
## 3        method        :   39
## 4        method       to   34
## 5        method        .   30
## 6             :   method   20
## 7        method       of   19
## 8        method       in   17
## 9        method        ,   17
## 10       method   method   15
## 11       method      the   11
## 12          the   method   11
## 13       method      and   10
## 14         Mech   method    9
## 15    sensitive   method    8
## 16            a   method    7
## 17         test material    6
## 18     material        ,    6
## 19     standard   method    5
## 20   Analytical   method    5
## 21       method     with    5
## 22 nanotoxicity   method    5
## 23        vitro   method    5
## 24       method     that    4
## 25       method      use    4
## 26          Nat   method    4
## 27  Statistical   method    4
## 28            &   method    4
## 29         this   method    4
## 30       method  Enzymol    4

Co-occurences for materials and material when it is the last lemma of the document

We could assume that the last occurrence in an article of the lemma “materials” correspond to the section title “material and methods”. As before, we will use co-occurrences see how words are connected to the last occurrence of “materials” in each documents, and see how often it correspond to a “materials and methods” section.

The first two functions select the last occurrence of a word in a document, and got the id of their sentences. A graph showing the connection of words for this subset of sentences is then plot.

create_subset_corpus_last_lemmas <- function(index){
  #this function is aimed to help construct a subset of x for the part of the analysis :
  #Co-occurences for materials and material when it is the last lemma of the document
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  lemma<-occurrence$lemma
  occurrences_in_doc=which(x$doc_id==doc_id & x$lemma==lemma)
  last_occurrence=occurrences_in_doc[length(occurrences_in_doc)]
  if (last_occurrence==index){return(strip_corpus(doc_id, sentence_id))} 
  return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)

# stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
# plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when it is the last lemma of the document")
# head_cooc(stats, lemma="materials")
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when materials is the last lemma of the document")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##           term1    term2 cooc
## 1             .   method  115
## 2        method      for   88
## 3        method        :   39
## 4        method       to   34
## 5        method        .   30
## 6             :   method   20
## 7        method       of   19
## 8        method       in   17
## 9        method        ,   17
## 10       method   method   15
## 11       method      the   11
## 12          the   method   11
## 13       method      and   10
## 14         Mech   method    9
## 15    sensitive   method    8
## 16            a   method    7
## 17         test material    6
## 18     material        ,    6
## 19     standard   method    5
## 20   Analytical   method    5
## 21       method     with    5
## 22 nanotoxicity   method    5
## 23        vitro   method    5
## 24       method     that    4
## 25       method      use    4
## 26          Nat   method    4
## 27  Statistical   method    4
## 28            &   method    4
## 29         this   method    4
## 30       method  Enzymol    4
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when it is the last lemma of the document")

head_cooc(stats, lemma="material")
##            term1     term2 cooc
## 1       material         .  104
## 2             of  material   97
## 3       material        at   83
## 4  supplementary  material   69
## 5       material        be   55
## 6       material         ,   48
## 7           this  material   40
## 8            the  material   36
## 9       material       and   35
## 10      nanotube  material   27
## 11      material        in   24
## 12      material         :   24
## 13      material available   23
## 14      material       for   21
## 15      material         (   17
## 16      nanosize  material   14
## 17       genetic  material   12
## 18           and  material   12
## 19          test  material   11
## 20     reference  material   10
## 21      material     refer   10
## 22      material      from    9
## 23             t  material    8
## 24      material        as    8
## 25             a  material    8
## 26             .  material    8
## 27      material        to    7
## 28      material        on    7
## 29     nanoscale  material    7
## 30            in  material    6
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when material is the last lemma of the document")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1     term2 cooc
## 1       material         .  104
## 2             of  material   97
## 3       material        at   83
## 4  supplementary  material   69
## 5       material        be   55
## 6       material         ,   48
## 7           this  material   40
## 8            the  material   36
## 9       material       and   35
## 10      nanotube  material   27
## 11      material        in   24
## 12      material         :   24
## 13      material available   23
## 14      material       for   21
## 15      material         (   17
## 16      nanosize  material   14
## 17       genetic  material   12
## 18           and  material   12
## 19          test  material   11
## 20     reference  material   10
## 21      material     refer   10
## 22           and    method   10
## 23        method         ,   10
## 24      material      from    9
## 25             t  material    8
## 26      material        as    8
## 27             a  material    8
## 28             .  material    8
## 29      material        to    7
## 30      material        on    7

Co-occurences for lemma materials and material when they are the first lemma of a sentence

Materials

create_subset_corpus <- function(index, target){
  #this function is aimed to help construct a subset of x for the part of the analysis :
  #Co-occurences for lemma materials and material when they are the first lemma of a sentence
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  #the following line query the first lemma of the sentence in the good document
  first_lemma<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[1],]$lemma
  if (first_lemma==target) {return(strip_corpus(doc_id, sentence_id))} 
  return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
                    target="materials")

subset_corpus<-do.call(rbind, subset_corpus)

# stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
# plot_cooccurrence(stats, lemma="materials", title="Co-occurences for lemma materials when it is the first lemma of a sentence")
# 
# head_cooc(stats, lemma="materials")
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1     term2 cooc
## 1       material         .  104
## 2             of  material   97
## 3       material        at   83
## 4  supplementary  material   69
## 5       material        be   55
## 6       material         ,   48
## 7           this  material   40
## 8            the  material   36
## 9       material       and   35
## 10      nanotube  material   27
## 11      material        in   24
## 12      material         :   24
## 13      material available   23
## 14      material       for   21
## 15      material         (   17
## 16      nanosize  material   14
## 17       genetic  material   12
## 18           and  material   12
## 19          test  material   11
## 20     reference  material   10
## 21      material     refer   10
## 22           and    method   10
## 23        method         ,   10
## 24      material      from    9
## 25             t  material    8
## 26      material        as    8
## 27             a  material    8
## 28             .  material    8
## 29      material        to    7
## 30      material        on    7

Material

occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
                    target="material")

subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for lemma material when it is the first lemma of a sentence")

head_cooc(stats, lemma="material")
##               term1            term2 cooc
## 1                 .         material   32
## 2          material              and   27
## 3                 t         material    4
## 4          material               be    4
## 5          material               on    3
## 6           methods         material    3
## 7            method         material    3
## 8          material             once    2
## 9          material        treatment    2
## 10         material          Implant    2
## 11             LDPE         material    2
## 12         material            after    2
## 13         material                &    2
## 14         material                ,    2
## 15         material               in    2
## 16      nano-scaled         material    2
## 17          reagent         material    1
## 18         Organism         material    1
## 19         material             with    1
## 20         material           supply    1
## 21 characterization         material    1
## 22         material characterization    1
## 23       validation         material    1
## 24          Reagent         material    1
## 25            study         material    1
## 26         material      composition    1
## 27          altered         material    1
## 28         material    investigation    1
## 29         Material         material    1
## 30         material         material    1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##               term1            term2 cooc
## 1                 .         material   32
## 2          material              and   27
## 3               and           method   14
## 4               and          methods   10
## 5            method              2.1    6
## 6                 t         material    4
## 7          material               be    4
## 8          material               on    3
## 9           methods         material    3
## 10           method         material    3
## 11         material             once    2
## 12           method           Animal    2
## 13         material        treatment    2
## 14         material          Implant    2
## 15             LDPE         material    2
## 16         material            after    2
## 17         material                &    2
## 18                &           method    2
## 19         material                ,    2
## 20         material               in    2
## 21      nano-scaled         material    2
## 22          methods                t    2
## 23          reagent         material    1
## 24           method                .    1
## 25          methods           Silica    1
## 26         Organism         material    1
## 27         material             with    1
## 28         material           supply    1
## 29 characterization         material    1
## 30         material characterization    1

Conclusion